No fast exponential deviation inequalities for the progressive mixture rule

نویسنده

  • Jean-Yves Audibert
چکیده

We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule ĝ satisfies ER(ĝ) ≤ ming∈G R(g) + C log |G| n , (1) where n denotes the size of the training set, E denotes the expectation w.r.t. the training set distribution and C denotes a positive constant. This work mainly shows that for any training set size n, there exist 2 > 0, a reference set G and a probability distribution generating the data such that with probability at least 2 R(ĝ) ≥ ming∈G R(g) + c √ log(|G|2−1) n , where c is a positive constant. In other words, surprisingly, for appropriate reference set G, the deviation convergence rate of the progressive mixture rule is only of order 1/ √ n while its expectation convergence rate is of order 1/n. The same conclusion holds for the progressive indirect mixture rule. This work also emphasizes on the suboptimality of algorithms based on penalized empirical risk minimization on G. 1 Setup and notation We assume that we observe n pairs of input-output denoted Z1 = (X1, Y1), . . . , Zn = (Xn, Yn) and that each pair has been independently drawn from the same unknown distribution denoted P . The input and output space are denoted respectively X and Y, so that P is a probability distribution on the product space Z , X × Y. The quality of a (prediction) function g : X → Y is measured by the risk (or generalization error): R(g) = E(X,Y )∼P `[Y, g(X)], where `[Y, g(X)] denotes the loss (possibly infinite) incurred by predicting g(X) when the true output is Y . We work under the following assumptions for the data space and the loss function ` : Y × Y → R ∪ {+∞}. Main assumptions. The input space is assumed to be infinite: |X | = +∞. The output space is a non-trivial (i.e. infinite) interval of R symmetrical w.r.t. some a ∈ R: for any y ∈ Y, we have 2a− y ∈ Y. The loss function is – uniformly exp-concave: there exists λ > 0 such that for any y ∈ Y, the set { y′ ∈ R : `(y, y′) < +∞} is an interval containing a on which the function y′ 7→ e−λ`(y,y) is concave. – symmetrical: for any y1, y2 ∈ Y, `(y1, y2) = `(2a− y1, 2a− y2), – admissible: for any y, y′ ∈ Y∩]a; +∞[, `(y, 2a− y′) > `(y, y′), – well behaved at center: for any y ∈ Y∩]a; +∞[, the function `y : y′ 7→ `(y, y′) is twice continuously differentiable on a neighborhood of a and `y(a) < 0. These assumptions imply that – Y has necessarily one of the following form: ] − ∞; +∞[, [a − ζ; a + ζ] or ]a− ζ; a+ ζ[ for some ζ > 0. – for any y ∈ Y, from the exp-concavity assumption, the function `y : y′ 7→ `(y, y′) is convex on the interval on which it is finite. As a consequence, the risk R is also a convex function (on the convex set of prediction functions for which it is finite). The assumptions were motivated by the fact that they are satisfied in the following settings: – least square loss with bounded outputs: Y = [ymin; ymax] and `(y1, y2) = (y1− y2) . Then we have a = (ymin+ymax)/2 and may take λ = 1/[2(ymax−ymin)]. – entropy loss: Y = [0; 1] and `(y1, y2) = y1 log ( y1 y2 ) +(1−y1) log ( 1−y1 1−y2 ) . Note that `(0, 1) = `(1, 0) = +∞. Then we have a = 1/2 and may take λ = 1. – exponential (or AdaBoost) loss: Y = [−ymax; ymax] and `(y1, y2) = e−y1y2 . Then we have a = 0 and may take λ = e−y 2 max . – logit loss: Y = [−ymax; ymax] and `(y1, y2) = log(1 + e−y1y2). Then we have a = 0 and may take λ = e−y 2 max . Progressive indirect mixture rule. Let G be a finite reference set of prediction functions. Under the previous assumptions, the only known algorithms satisfying (1) are the progressive indirect mixture rules defined below. For any i ∈ {0, . . . , n}, the cumulative loss suffered by the prediction function g on the first i pairs of input-output is Σi(g) , ∑i j=1 `[Yj , g(Xj)], 1 Indeed, if ξ denotes the function e−λ`y , from Jensen’s inequality, for any probability distribution, E`y(Y ) = E (− 1 λ log ξ(Y ) ) ≥ − 1 λ logEξ(Y ) ≥ − 1 λ log ξ(EY ) = `y(EY ).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Progressive mixture rules are deviation suboptimal

We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule ĝ ...

متن کامل

Progressive mixture rules are deviation suboptimal

We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. IfR(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule ĝ s...

متن کامل

Deviation Inequalities for Exponential Jump-diffusion Processes

In this note we obtain deviation inequalities for the law of exponential jump-diffusion processes at a fixed time. Our method relies on convex concentration inequalities obtained by forward/backward stochastic calculus. In the pure jump and pure diffusion cases, it also improves on classical results obtained by direct application of Gaussian and Poisson bounds.

متن کامل

Exact hypothesis testing and confidence interval for mean of the exponential distribution under Type-I progressive hybrid censoring

 ‎Censored samples are discussed in experiments of life-testing; i.e‎. ‎whenever the experimenter does not observe the failure times of all units placed on a life test‎. ‎In recent years‎, ‎inference based on censored sampling is considered‎, ‎so that about the parameters of various distributions such as ‎normal‎, ‎exponential‎, ‎gamma‎, ‎Rayleigh‎, ‎Weibull‎, ‎log normal‎, ‎inverse Gaussian‎, ...

متن کامل

Deviation Inequalities on Largest Eigenvalues

In these notes, we survey developments on the asymptotic behavior of the largest eigenvalues of random matrix and random growth models, and describe the corresponding known non-asymptotic exponential bounds. We then discuss some elementary and accessible tools from measure concentration and functional analysis to reach some of these quantitative inequalities at the correct small deviation rate ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007